Aerial image classification is considered an open challenge due to its properties and the presence of various complex images. Given the complexity and variation in aerial images, this paper proposes two hybrid models for classification. The first hybrid model combines features extracted from ResNet-50 and the Vision Transformer (ViT), followed by the application of multi-head attention (MHA) to detect the most informative features. The second hybrid model also extracts features from ResNet-50 and ViT, then applies cross-attention. Both hybrid models are assessed using the benchmark Sikkim Aerial Images Dataset for Object Detection (SAIOD). The efficacy of the two hybrid models is assessed using the well-established performance metrics, including precision, recall, F1-score, and the ROC curve. The results indicate that the first model, which employs MHA, achieves superior performance with an accuracy of 95.80%. Both models outperform the best existing methods, achieving accuracies of 95.80% and 95.52%, respectively.
Loading....